Group Task

Authors

Marco Boso 100535153

Diego Paroli 100554973

Yijia Lin 100452242

Bradley McKenzie 100535241

Linghan Zheng 100540803

Jia Lin 100536210

Isabel Monge 100542532

Objectives and mandatory items

The objective of the delivery is to perform an analysis of the electoral data, carrying out the debugging, summaries and graphs you consider, both of their results and the accuracy of the electoral polls.

Specifically, you must work only in the time window that includes the elections from 2008 to the last elections of 2019.

General comments

In addition to what you see fit to execute, the following items are mandatory:

  • Each group should present before 9th January (23:59) an analysis of the data in .qmd and .html format in Quarto slides mode, which will be the ones they will present on the day of the presentation.

  • Quarto slides should be uploaded to Github (the link should be provided by a member of each group).

  • The maximum number of slides should be 40. The maximum time for each group will be 20-22 minutes (+5 minutes for questions).

  • During the presentation you will explain (summarised!) the analysis performed so that each team member speaks for a similar amount of time and each member can be asked about any of the steps. The grade does not have to be the same for all members.

  • It will be valued not only the content but also the container (aesthetics).

  • The objective is to demonstrate that the maximum knowledge of the course has been acquired: the more content of the syllabus is included, the better.

Mandatory items:

  1. Data should be converted to tidydata where appropriate.

  2. You should include at least one join between tables.

  3. Reminder: information = variance, so remove columns that are not going to contribute anything.

  4. The glue and lubridate packages should be used at some point, as well as the forcats. The use of ggplot2 will be highly valued.

  5. The following should be used at least once:

    • mutate
    • summarise
    • group_by (or equivalent)
    • case_when
  6. We have many, many parties running for election. We will only be interested in the following parties:

    • PARTIDO SOCIALISTA OBRERO ESPAÑOL (beware: it has/had federations - branches - with some other name).
    • PARTIDO POPULAR
    • CIUDADANOS (caution: has/had federations - branches - with some other name)
    • PARTIDO NACIONALISTA VASCO
    • BLOQUE NACIONALISTA GALLEGO
    • CONVERGÈNCIA I UNIÓ
    • UNIDAS PODEMOS - IU (beware that here they have had various names - IU, podem, ezker batua, …- and have not always gone together, but here we will analyze them together)
    • ESQUERRA REPUBLICANA DE CATALUNYA
    • EH - BILDU (are now a coalition of parties formed by Sortu, Eusko Alkartasuna, Aralar, Alternatiba)
    • MÁS PAÍS
    • VOX
  7. Anything other than any of the above parties should be imputed as “OTHER”. Remember to add properly the data after the previous recoding.

  8. Party acronyms will be used for the visualizations. The inclusion of graphics will be highly valued (see https://r-graph-gallery.com/).

  9. You must use all 4 data files at some point.

  10. You must define at least one (non-trivial) function of your own.

  11. You will have to discard mandatory polls that:

-   refer to elections before 2008
-   that are exit polls
-   have a sample size of less than 750 or are unknown
-   that have less than 1 or less fieldwork days
  1. You must obligatorily answer the following questions (plus those that you consider analyzing to distinguish yourself from the rest of the teams, either numerically and/or graphically)
-   Which party was the winner in the municipalities with more than 100,000 habitants (census) in each of the elections?
-   Which party was the second when the first was the PSOE? And when the first was the PP?
-   Who benefits from low turnout?
-   How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?
-   How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?
-   Which polling houses got it right the most and which ones deviated the most from the results?

You should include at least 3 more “original” questions that you think that it could be interesting to be answer with the data.

Marks

The one who does the most things will not be valued the most. More is not always better. The originality (with respect to the rest of the works, for example in the analyzed or in the subject or …) of what has been proposed, in the handling of tables (or in the visualization), the caring put in the delivery (care in life is important) and the relevance of what has been done will be valued. Once you have the mandatory items with your database more or less completed, think before chopping code: what could be interesting? What do I need to get a summary both numerical and visual?

Remember that the real goal is to demonstrate a mastery of the tools seen throughout the course. And that happens not only by the quantity of them used but also by the quality when executing them.

Some dataviz will be extremely positive valued.

Required packages

Insert in the lower chunk the packages you will need

rm(list = ls())
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(DataExplorer)
library(glue)
library(ggplot2)
library(forcats)
library(ggrepel)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(ggpubr)

Data

The practice will be based on the electoral data archives below, compiling data on elections to the Spanish Congress of Deputies from 2008 to the present, as well as surveys, municipalities codes and abbreviations.

# NO TOQUES NADA
election_data <- read_csv(file = "./data/datos_elecciones_brutos.csv")
Rows: 48737 Columns: 471
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): tipo_eleccion, mes, codigo_ccaa, codigo_provincia, codigo_municipio
dbl (424): anno, vuelta, codigo_distrito_electoral, numero_mesas, censo, par...
lgl  (42): FALANGE ESPAÑOLA DE LA JONS, PARTIDO COMUNISTA DEL PUEBLO CASTELL...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cod_mun <- read_csv(file = "./data/cod_mun.csv")
Rows: 8135 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): cod_mun, municipio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <- read_csv(file = "./data/historical_surveys.csv")
Rows: 3753 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): type_survey, id_pollster, pollster, media
dbl  (51): size, turnout, UCD, PSOE, PCE, AP, CIU, PA, EAJ-PNV, HB, ERC, EE,...
lgl   (1): exit_poll
date  (3): date_elec, field_date_from, field_date_to

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
abbrev <- read_csv(file = "./data/siglas.csv")
Rows: 587 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): denominacion, siglas

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data will be as follows:

  • election_data: file with election data for Congress from 2018 to the last ones in 2019.

    • tipo_eleccion: type of election (02 if congressional election)
    • anno, mes: year and month of elections
    • vuelta: electoral round (1 if first round)
    • codigo_ccaa, codigo_provincia, codigo_municipio, codigo_distrito_electoral: code of the ccaa, province, municipality and electoral district.
    • numero_mesas: number of polling stations
    • censo: census
    • participacion_1, participacion_2: participation in the first preview (14:00) and second preview (18:00) before polls close (20:00)
    • votos_blancos: blank ballots
    • votos_candidaturas: party ballots
    • votos_nulos: null ballots
    • ballots for each party
  • cod_mun: file with the codes and names of each municipality

  • abbrev: acronyms and names associated with each party

  • surveys: table of electoral polls since 1982. Some of the variables are the following:

    • type_survey: type of survey (national, regional, etc.)
    • date_elec: date of future elections
    • id_pollster, pollster, media: id and name of the polling company, as well as the media that commissioned it.
    • field_date_from, field_date_to: start and end date of fieldwork
    • exit_poll: whether it is an exit poll or not
    • size: sample size
    • turnout: turnout estimate
    • estimated voting intentions for the main parties

Cleaning the data – surveys

# Filter dataset
cleaned_surveys <- surveys |>
  mutate(
    # Parse dates variables as date objects
    field_date_from = ymd(field_date_from),
    field_date_to = ymd(field_date_to),
    date_elec = ymd(date_elec),
    # Calculate the number of fieldwork days
    fieldwork_days = as.numeric(field_date_to - field_date_from + 1)
  ) |>
  filter(
    !exit_poll,                           # Exclude exit polls
    date_elec >= ymd("2008-01-01"),       # Exclude polls referred to elections before 2008
    size >= 750,                          # Exclude polls with sample size < 750
    fieldwork_days > 1                    # Exclude polls with 1 or fewer fieldwork days
  )

# Deleting columns that only have NAs
cleaned_surveys <- cleaned_surveys |> 
  select(where(~ !all(is.na(.))))

# Identify party columns dynamically
metadata_columns <- c("type_survey", "date_elec", "id_pollster", "pollster", "media",
                      "field_date_from", "field_date_to", "fieldwork_days", "exit_poll", 
                      "size", "turnout")
party_columns <- setdiff(colnames(cleaned_surveys), metadata_columns)

# Reshape data into long format
tidy_surveys <- cleaned_surveys |>
  pivot_longer(
    cols = all_of(party_columns),  # Reshape party columns
    names_to = "party_raw",        # Raw party names
    values_to = "votes"            # Corresponding voting intentions
  )

# add on party names by code
tidy_surveys <- tidy_surveys %>%
  mutate(
    party = case_when(
      party_raw == "PSOE" ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
      party_raw == "CIU" ~ "CONVERGÈNCIA I UNIÓ",
      party_raw == "EAJ-PNV" ~ "PARTIDO NACIONALISTA VASCO",
      party_raw == "ERC" ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
      party_raw == "IU" ~ "UNIDAS PODEMOS - IU",
      party_raw == "UP" ~ "UNIDAS PODEMOS - IU",
      party_raw == "PP" ~ "PARTIDO POPULAR",
      party_raw == "BNG" ~ "BLOQUE NACIONALISTA GALLEGO",
      party_raw == "CS" ~ "CIUDADANOS",
      party_raw == "EH-BILDU" ~ "EH - BILDU",
      party_raw == "PODEMOS" ~ "UNIDAS PODEMOS - IU",
      party_raw == "VOX" ~ "VOX",
      party_raw == "MP" ~ "MÁS PAÍS",
      TRUE ~ "OTHER")
  )

# Create a column for proper, unqique acronyms
tidy_surveys <- tidy_surveys |> 
  mutate(
    party_code = case_when(
      party == "UNIDAS PODEMOS - IU"~ "PODEMOS-IU",
      party == "OTHER"~ "OTHER",
      TRUE ~ party_raw)
  )

# Select relevant columns
# Getting rid of type_survey, exit_poll (take only 1 value), party_raw
final_surveys <- tidy_surveys |>
  select(-type_survey, -exit_poll, -party_raw) |> 
  relocate(fieldwork_days, .after = field_date_to) |> 
  relocate(votes, .after = party_code) 

# Summing all votes based on the party reclassification
final_surveys <- final_surveys |> 
  group_by(across(-votes)) |> 
  summarize(votes = sum(votes, na.rm = TRUE), .groups = "drop") |> 
  arrange(field_date_from)
# We have 1614 surveys (rows from cleaned_surveys), 12 parties (meaning 12 rows per survey). Thus 1614x12=19368 rows

# Preview
final_surveys
# A tibble: 19,368 × 12
   date_elec  id_pollster pollster media field_date_from field_date_to
   <date>     <chr>       <chr>    <chr> <date>          <date>       
 1 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 2 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 3 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 4 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 5 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 6 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 7 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 8 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
 9 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
10 2008-03-09 pollster-6  GALLUP   <NA>  2004-04-01      2004-04-21   
# ℹ 19,358 more rows
# ℹ 6 more variables: fieldwork_days <dbl>, size <dbl>, turnout <dbl>,
#   party <chr>, party_code <chr>, votes <dbl>

Creating table for party codes

Creating a table to link each party name to its unique code

party_info <- final_surveys |> 
  select(party, party_code) |> 
  unique()

Cleaning the data – election_data

The election_data file is large and requires quite extensive cleaning to make it “tidy”. We will tidy the data to try make it most useful for future analysis. The election data starts off with 48,737 rows and 471 columns. Reducing the number of columns is a clear priority.

First, we look at the quality of the data and see if any information is redundant and can be removed.

plot_intro(election_data)

# We see 1.9% missing colums, identify the cols with no data - we have 9 cols. 
blank_cols <- names(election_data)[sapply(election_data, function(x) all(is.na(x)))]

# Drop these columns and also filter to ensure no info outside 2008 to 2019 is included. 
election_data <- election_data |> 
  select(-all_of(blank_cols)) |> 
  filter(anno >= 2008 & anno <= 2019)

# Drop columns that are logical
election_data <- election_data %>%
  select(where(~ !is.logical(.)))
# See the improvements
plot_intro(election_data)

Second, we begin to make the election data tidy. We start by pivoting the data so all columns for party names are within one “party” variable. Before this we have 414 columns referring to parties.

# Pivot all the party names and ballot counts to the main table
election_pivot <- election_data |> 
  pivot_longer(
    cols = `BERDEAK-LOS VERDES`:`COALICIÓN POR MELILLA`, # select all party data
    names_to = "party",
    values_to = "ballots"
  )
str(election_pivot)
tibble [20,177,118 × 17] (S3: tbl_df/tbl/data.frame)
 $ tipo_eleccion            : chr [1:20177118] "02" "02" "02" "02" ...
 $ anno                     : num [1:20177118] 2008 2008 2008 2008 2008 ...
 $ mes                      : chr [1:20177118] "03" "03" "03" "03" ...
 $ vuelta                   : num [1:20177118] 1 1 1 1 1 1 1 1 1 1 ...
 $ codigo_ccaa              : chr [1:20177118] "14" "14" "14" "14" ...
 $ codigo_provincia         : chr [1:20177118] "01" "01" "01" "01" ...
 $ codigo_municipio         : chr [1:20177118] "001" "001" "001" "001" ...
 $ codigo_distrito_electoral: num [1:20177118] 0 0 0 0 0 0 0 0 0 0 ...
 $ numero_mesas             : num [1:20177118] 2 2 2 2 2 2 2 2 2 2 ...
 $ censo                    : num [1:20177118] 1838 1838 1838 1838 1838 ...
 $ participacion_1          : num [1:20177118] 677 677 677 677 677 677 677 677 677 677 ...
 $ participacion_2          : num [1:20177118] 1008 1008 1008 1008 1008 ...
 $ votos_blancos            : num [1:20177118] 23 23 23 23 23 23 23 23 23 23 ...
 $ votos_nulos              : num [1:20177118] 13 13 13 13 13 13 13 13 13 13 ...
 $ votos_candidaturas       : num [1:20177118] 1269 1269 1269 1269 1269 ...
 $ party                    : chr [1:20177118] "BERDEAK-LOS VERDES" "ARALAR" "PARTIDO OBRERO SOCIALISTA INTERNACIONALISTA" "ALTERNATIVA MOTOR Y DEPORTES" ...
 $ ballots                  : num [1:20177118] 9 27 1 1 2 238 61 85 4 17 ...
head(election_pivot)
# A tibble: 6 × 17
  tipo_eleccion  anno mes   vuelta codigo_ccaa codigo_provincia codigo_municipio
  <chr>         <dbl> <chr>  <dbl> <chr>       <chr>            <chr>           
1 02             2008 03         1 14          01               001             
2 02             2008 03         1 14          01               001             
3 02             2008 03         1 14          01               001             
4 02             2008 03         1 14          01               001             
5 02             2008 03         1 14          01               001             
6 02             2008 03         1 14          01               001             
# ℹ 10 more variables: codigo_distrito_electoral <dbl>, numero_mesas <dbl>,
#   censo <dbl>, participacion_1 <dbl>, participacion_2 <dbl>,
#   votos_blancos <dbl>, votos_nulos <dbl>, votos_candidaturas <dbl>,
#   party <chr>, ballots <dbl>

We now have a table with 20,177,118 rows and 17 columns.

This is more clean than previously, but we still need to aggregate of our party variables into the main party groups. We will do this by creating a mapping table (party_names) that standardizes the raw party names into main party groupings (party_main) using regular expressions.

party_names <- tibble(names = unique(election_pivot$party))

# Party names in the election_data file do not match up perfectly with the abbrev file (i.e. some of the names present in party_names are not in abbrev)
# So it is better to work directly on party_names instead of using abbrev

party_names <- party_names |> 
    mutate(party_main = case_when(
                str_detect(names, "(?i)PSOE|PARTIDO DOS SOCIALISTAS DE GALICIA|PARTIDO SOCIALISTA DE EUSKADI|PARTIDO SOCIALISTA OBRERO ESPAÑOL|PARTIT SOCIALISTA OBRER ESPANYOL|PARTIT DELS SOCIALISTES DE CATALUNYA") ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
                str_detect(names, "(?i)PARTIDO POPULAR") ~ "PARTIDO POPULAR",
                str_detect(names, "(?i)CIUDADANOS-PARTIDO DE LA CIUDADANIA|CIUDADANOS-PARTIDO DE LA CIUDADANÍA|CIUDADANOS PARTIDO DE LA CIUDADANIA|CIUDADANOS PARTIDO DE LA CIUDADANÍA|CIUDADANOS, PARTIDO DE LA CIUDADANÍA|CIUTADANS") ~ "CIUDADANOS",
                str_detect(names, "(?i)EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO") ~ "PARTIDO NACIONALISTA VASCO",
                str_detect(names, "(?i)BLOQUE NACIONALISTA GALEGO|BNG") ~ "BLOQUE NACIONALISTA GALLEGO",
                str_detect(names, "(?i)CONVERGENCIA I UNIO|CONVERGÈNCIA I UNIÓ") ~ "CONVERGÈNCIA I UNIÓ",
                str_detect(names, "(?i)PODEM|EZKER BATUA|EZKER ANITZA|IZQUIERDA UNIDA|ESQUERRA UNIDA|ESQUERDA UNIDA") ~ "UNIDAS PODEMOS - IU",
                str_detect(names, "(?i)ESQUERRA REPUBLICANA") ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
                str_detect(names, "(?i)BILDU|EUSKO ALKARTASUNA|ARALAR|SORTU|ALTERNATIBA") ~ "EH - BILDU",
                str_detect(names, "(?i)MÁS PAÍS") ~ "MÁS PAÍS",
                str_detect(names, "(?i)VOX") ~ "VOX",
                TRUE ~ "OTHER")
    )

unique(party_names$party_main)
 [1] "OTHER"                             "EH - BILDU"                       
 [3] "PARTIDO POPULAR"                   "UNIDAS PODEMOS - IU"              
 [5] "PARTIDO SOCIALISTA OBRERO ESPAÑOL" "PARTIDO NACIONALISTA VASCO"       
 [7] "CIUDADANOS"                        "ESQUERRA REPUBLICANA DE CATALUNYA"
 [9] "CONVERGÈNCIA I UNIÓ"               "BLOQUE NACIONALISTA GALLEGO"      
[11] "VOX"                               "MÁS PAÍS"                         
# Adding party code to party_names dataframe
party_names <- party_names |> 
  left_join(party_info, by = join_by(party_main == party))

Now join on the main party names and codes to our election table. Testing was undertaken and the join of a table was more efficient than alternatives (e.g. str_detects over election_pivot or rowwise summaries).

# Join party main and party code into main df
election_pivot <- election_pivot |> 
  left_join(party_names, by = join_by(party == names))

Now we will include some additional information that will make the analysis potentially easier later, including province and total votes counts from our data:

# Create municipal code to join on municipal names. 
# Create total votes column
election_pivot <- election_pivot |>
  mutate(cod_mun = paste(codigo_ccaa, codigo_provincia, codigo_municipio, sep="-"),
         total_votes = votos_blancos + votos_nulos + votos_candidaturas)

# Join municipality names
election_pivot <- election_pivot |> 
  left_join(cod_mun, by = join_by(cod_mun))  

# Check quality of the join and whether NA's have been introduced as municipality names
any(is.na(election_pivot$municipio))
[1] FALSE

Be careful not all 8135 municipalities appear in each election. We have 6 elections and 414 parties, thus we should have 6x414=2484 occurrences for each municipality, but that is not the case.

Also be careful some municipalities have the same name (but different mun_code), so if you ever need to group by municipality remember to group by mun_code instead of municipality.

# Count the number of times each municipaly appears and then get the unique values for that count (not all are 2484) meaning some municipalities are not present in certain elections
election_pivot |> count(cod_mun) |> pull(n) |> unique()
[1] 2484 1656  828 1242 2070
#Number of unique values for cod_mun is different than number of unique values for municipio
n_distinct(cod_mun$cod_mun)
[1] 8135
n_distinct(cod_mun$municipio)
[1] 8118

Now we need to group together all of the votes for “OTHER” variables and create unique identifiers for each individual election in our dataframes.

Currently we have a table of 22 variables with 20,177,118 rows. We can clean this more.

First, identify the redundant data in our election. We can remove:

tipo_eleccion - because all values = 02. It is not useful vuelta = because all values = 1, it is not useful. geographic variables = we will remove codigo_municipio is included in cod_mun which we joined on from the cod_mun table. We keep the autonomous community and proivnce variables for potential future aggregation and analysis. codigo_distrito_electoral - because every value is zero. It is not useful.

Notably, we have many NA ballot rows and a row for each individual party at each election, where will also try to reduce this when we aggregate the party data with the “party_main” variable created.

# To clean the data more, reduce our dataset and rename key variables so everything is more consistent in English
tidy_election <- election_pivot |> 
  select(year = anno, 
         month = mes,
         code_community = codigo_ccaa,
         code_province = codigo_provincia,
         code_municipality = cod_mun,
         municipality = municipio,
         population = censo,
         polling_stations = numero_mesas,
         participation_1 = participacion_1,
         participation_2 = participacion_2,
         blank_votes = votos_blancos,
         null_votes = votos_nulos,
         valid_votes = votos_candidaturas,
         total_votes,
         party_main,
         party_code,
         ballots)

summary(tidy_election)
      year         month           code_community     code_province     
 Min.   :2008   Length:20177118    Length:20177118    Length:20177118   
 1st Qu.:2011   Class :character   Class :character   Class :character  
 Median :2016   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2015                                                           
 3rd Qu.:2019                                                           
 Max.   :2019                                                           
                                                                        
 code_municipality  municipality         population      polling_stations  
 Length:20177118    Length:20177118    Min.   :      3   Min.   :   1.000  
 Class :character   Class :character   1st Qu.:    144   1st Qu.:   1.000  
 Mode  :character   Mode  :character   Median :    454   Median :   1.000  
                                       Mean   :   4249   Mean   :   7.261  
                                       3rd Qu.:   1858   3rd Qu.:   3.000  
                                       Max.   :2384269   Max.   :3742.000  
                                                                           
 participation_1   participation_2    blank_votes         null_votes      
 Min.   :      0   Min.   :      0   Min.   :    0.00   Min.   :    0.00  
 1st Qu.:     57   1st Qu.:     86   1st Qu.:    1.00   1st Qu.:    1.00  
 Median :    185   Median :    278   Median :    3.00   Median :    4.00  
 Mean   :   1640   Mean   :   2448   Mean   :   28.71   Mean   :   29.84  
 3rd Qu.:    720   3rd Qu.:   1109   3rd Qu.:   12.00   3rd Qu.:   16.00  
 Max.   :1022073   Max.   :1531231   Max.   :17409.00   Max.   :16527.00  
                                                                          
  valid_votes       total_votes       party_main         party_code       
 Min.   :      1   Min.   :      2   Length:20177118    Length:20177118   
 1st Qu.:    106   1st Qu.:    109   Class :character   Class :character  
 Median :    336   Median :    343   Mode  :character   Mode  :character  
 Mean   :   3025   Mean   :   3084                                        
 3rd Qu.:   1364   3rd Qu.:   1393                                        
 Max.   :1847096   Max.   :1872679                                        
                                                                          
    ballots        
 Min.   :     1    
 1st Qu.:     3    
 Median :    15    
 Mean   :   372    
 3rd Qu.:    93    
 Max.   :919701    
 NA's   :19781159  
tidy_election <- tidy_election |> 
  group_by(across(-ballots))|> 
  summarise(party_ballots = sum(ballots, na.rm=TRUE), .groups = "drop")

tidy_election
# A tibble: 584,844 × 17
    year month code_community code_province code_municipality municipality
   <dbl> <chr> <chr>          <chr>         <chr>             <chr>       
 1  2008 03    01             04            01-04-001         Abla        
 2  2008 03    01             04            01-04-001         Abla        
 3  2008 03    01             04            01-04-001         Abla        
 4  2008 03    01             04            01-04-001         Abla        
 5  2008 03    01             04            01-04-001         Abla        
 6  2008 03    01             04            01-04-001         Abla        
 7  2008 03    01             04            01-04-001         Abla        
 8  2008 03    01             04            01-04-001         Abla        
 9  2008 03    01             04            01-04-001         Abla        
10  2008 03    01             04            01-04-001         Abla        
# ℹ 584,834 more rows
# ℹ 11 more variables: population <dbl>, polling_stations <dbl>,
#   participation_1 <dbl>, participation_2 <dbl>, blank_votes <dbl>,
#   null_votes <dbl>, valid_votes <dbl>, total_votes <dbl>, party_main <chr>,
#   party_code <chr>, party_ballots <dbl>

Joining year and month into one variable

final_election <- tidy_election |> 
  mutate(date_elec = glue("{year}-{month}-01")) |> 
  relocate(date_elec, .before = year) |> 
  select(-year, -month)

#Adding correct days to match survey dataframe
final_election <- final_election |> 
  mutate(
    date_elec = ymd(case_when(
      date_elec == "2008-03-01" ~ "2008-03-09",
      date_elec == "2011-11-01" ~ "2011-11-20",
      date_elec == "2015-12-01" ~ "2015-12-20",
      date_elec == "2016-06-01" ~ "2016-06-26",
      date_elec == "2019-04-01" ~ "2019-04-28",
      date_elec == "2019-11-01" ~ "2019-11-10"))
  )

str(final_election)
tibble [584,844 × 16] (S3: tbl_df/tbl/data.frame)
 $ date_elec        : Date[1:584844], format: "2008-03-09" "2008-03-09" ...
 $ code_community   : chr [1:584844] "01" "01" "01" "01" ...
 $ code_province    : chr [1:584844] "04" "04" "04" "04" ...
 $ code_municipality: chr [1:584844] "01-04-001" "01-04-001" "01-04-001" "01-04-001" ...
 $ municipality     : chr [1:584844] "Abla" "Abla" "Abla" "Abla" ...
 $ population       : num [1:584844] 1180 1180 1180 1180 1180 1180 1180 1180 1180 1180 ...
 $ polling_stations : num [1:584844] 2 2 2 2 2 2 2 2 2 2 ...
 $ participation_1  : num [1:584844] 524 524 524 524 524 524 524 524 524 524 ...
 $ participation_2  : num [1:584844] 798 798 798 798 798 798 798 798 798 798 ...
 $ blank_votes      : num [1:584844] 1 1 1 1 1 1 1 1 1 1 ...
 $ null_votes       : num [1:584844] 6 6 6 6 6 6 6 6 6 6 ...
 $ valid_votes      : num [1:584844] 941 941 941 941 941 941 941 941 941 941 ...
 $ total_votes      : num [1:584844] 948 948 948 948 948 948 948 948 948 948 ...
 $ party_main       : chr [1:584844] "BLOQUE NACIONALISTA GALLEGO" "CIUDADANOS" "CONVERGÈNCIA I UNIÓ" "EH - BILDU" ...
 $ party_code       : chr [1:584844] "BNG" "CS" "CIU" "EH-BILDU" ...
 $ party_ballots    : num [1:584844] 0 0 0 0 0 0 19 0 382 512 ...

Election identifiers:

  • Timing -> date
  • Area information -> code_community (autonomous community), code_province, code_municipality, municipality, population
  • General election information -> polling_stations, participation_1, participation_2, blank_votes, null_votes, valid_votes, total_votes
  • Party votes received -> party_main, party_code, party_ballots

Creating turnout dataframe

Creating a dataframe storing all the turnout data for each municipality in each election in case we need to work just on turnout or other data that does not change by party.

All this info is still included in final_election

turnout <- final_election |> 
  select(
    date_elec, code_community, code_province, code_municipality, municipality,
    population, polling_stations, participation_1, participation_2, 
    blank_votes, null_votes, valid_votes, total_votes
  ) |> 
  distinct()
turnout
# A tibble: 48,737 × 13
   date_elec  code_community code_province code_municipality municipality       
   <date>     <chr>          <chr>         <chr>             <chr>              
 1 2008-03-09 01             04            01-04-001         Abla               
 2 2008-03-09 01             04            01-04-002         Abrucena           
 3 2008-03-09 01             04            01-04-003         Adra               
 4 2008-03-09 01             04            01-04-004         Albanchez          
 5 2008-03-09 01             04            01-04-005         Alboloduy          
 6 2008-03-09 01             04            01-04-006         Albox              
 7 2008-03-09 01             04            01-04-007         Alcolea            
 8 2008-03-09 01             04            01-04-008         Alcóntar           
 9 2008-03-09 01             04            01-04-009         Alcudia de Monteag…
10 2008-03-09 01             04            01-04-010         Alhabia            
# ℹ 48,727 more rows
# ℹ 8 more variables: population <dbl>, polling_stations <dbl>,
#   participation_1 <dbl>, participation_2 <dbl>, blank_votes <dbl>,
#   null_votes <dbl>, valid_votes <dbl>, total_votes <dbl>

Recap cleaning

We have 2 primary datasets at this stage, election data and survey data, plus a turnout dataframe which is a subset of the election data. For surveys, the data has been cleaned so each row represents the votes for one party within a specific national poll. For elections, the data has been cleaned so each row represents the number of votes for a party within an election in a specific municipality.

The final_surveys data includes:

  • election date, pollster and media information, fieldwork dates
  • size of the survey and turnout
  • party name, party code
  • votes received (for that party in that poll)

The final_election data includes:

  • date of the election
  • party name, party code (with non-primary parties grouped)
  • identifier for autonomous community, province and municipality
  • municipality population
  • election information such as number of polling stations, votes per session
  • ballots received (for that party per election in each municipality)

The turnout data includes:

  • information on the number of votes and type of vote (e.g. valid or blank/null) per municipality in each election.

!!!!!!! WORK ON DATAFRAMES final_surveys, final_election, turnout !!!!!!!
!!!!!!! DO NOT OVERWRITE THESE DATAFRAMES, CREATE NEW ONES IF YOU NEED TO MODIFY THEM (ex. surveys_q1 <- final_surveys) !!!!!!!

Mandatory questions

1.Which party was the winner in the municipalities with more than 100,000 habitants (census) in each of the elections?

We create a table (winning_table) here that has a column listing each municipality that reported over 100k habitants for at least one election. Then there are columns for each of the elections that report on who won the election and the % of total vote the winning party received.

First, explore the data and see that 48 of the 8,135 municipalities had over 100k people. We use code_municipality as some municipalities have the same name.

# count of provinces to include (for checking)
final_election |> 
  filter(population >100000) |> 
  summarise(num_100k = length(unique(code_municipality))) # we have 48 municipalities 
# A tibble: 1 × 1
  num_100k
     <int>
1       48
# create table to identify winning party only in provinces >100k
winners100k <- final_election |> 
  filter(population > 100000) |> 
  group_by(date_elec, code_municipality, municipality) |> 
  mutate(vote_percent = as.character(glue("{party_main} ({round(party_ballots/valid_votes*100, 1)}%)"))) |> # create name + vote% value. 
  slice_max(order_by = party_ballots, n = 1) |> #select winner only
  ungroup() |> 
  select(date_elec, code_municipality, municipality, population, party_main, party_code, valid_votes, party_ballots, vote_percent)


# Pivot to create summary table of only the municipality, winner and vote getting %
winning_table <- winners100k |> 
  select(municipality, date_elec, vote_percent) |> 
  pivot_wider(names_from = date_elec, 
              values_from = vote_percent)

# print output
winning_table
# A tibble: 48 × 7
   municipality `2008-03-09` `2011-11-20` `2015-12-20` `2016-06-26` `2019-04-28`
   <chr>        <chr>        <chr>        <chr>        <chr>        <chr>       
 1 Almería      PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
 2 Cádiz        PARTIDO SOC… PARTIDO POP… UNIDAS PODE… <NA>         <NA>        
 3 Jerez de la… PARTIDO SOC… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
 4 Córdoba      PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
 5 Granada      PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
 6 Huelva       PARTIDO SOC… PARTIDO POP… PARTIDO SOC… PARTIDO POP… PARTIDO SOC…
 7 Málaga       PARTIDO SOC… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
 8 Sevilla      PARTIDO SOC… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
 9 Zaragoza     PARTIDO SOC… PARTIDO POP… PARTIDO POP… PARTIDO POP… PARTIDO SOC…
10 Gijón        PARTIDO SOC… PARTIDO SOC… UNIDAS PODE… PARTIDO POP… PARTIDO SOC…
# ℹ 38 more rows
# ℹ 1 more variable: `2019-11-10` <chr>

Some of our winning table, have NA values, we confirm that this is because there code was <100k in some elections and over 100k in others:

# Check why some have NA's - pull municipality names
varying_pop_municips <- winning_table |>
  filter_all(any_vars(is.na(.))) |>
  pull(municipality)

# filter for those municipalities to show the population varies +/- 100k
final_election |> 
  filter(municipality %in% varying_pop_municips) |> 
  group_by(municipality, date_elec, population) |> 
  summarise() |> 
  pivot_wider(names_from = date_elec, 
              values_from = population)
`summarise()` has grouped output by 'municipality', 'date_elec'. You can
override using the `.groups` argument.
# A tibble: 2 × 7
# Groups:   municipality [2]
  municipality `2008-03-09` `2011-11-20` `2015-12-20` `2016-06-26` `2019-04-28`
  <chr>               <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
1 Cádiz              106478       103946       100460        99886        97629
2 Dos Hermanas        90887        96443        99637       100095       102177
# ℹ 1 more variable: `2019-11-10` <dbl>

Now we will graph the results:

# Define custom colors for the given party abbreviations
party_colors <- c(
  "PSOE" = "#EF1C27",       # PARTIDO SOCIALISTA OBRERO ESPAÑOL
  "EAJ-PNV" = "#01796F",    # PARTIDO NACIONALISTA VASCO
  "ERC" = "#FFD700",        # ESQUERRA REPUBLICANA DE CATALUNYA
  "PODEMOS-IU" = "#5C2D91", # UNIDAS PODEMOS - IU
  "PP" = "#0454A3",         # PARTIDO POPULAR
  "VOX" = "#63BE21",        # VOX
  "CS" = "#EB6109",         # CIUDADANOS
  "BNG" = "#0093DD",        # BLOQUE NACIONALISTA GALLEGO
  "CIU" = "#FFB232",        # CONVERGÈNCIA I UNIÓ
  "EH-BILDU" = "#A6CE39",   # EH - BILDU
  "MP" = "#0CDCC3",         # MÁS PAÍS
  "Other" = "#808080"       # Other parties
)
# Count the number of municipalities each party won in each election
library(forcats)

party_winners <- winners100k |> 
  group_by(party_code) |> 
  mutate(total_wins = n()) |> 
  mutate(party_grouped = party_code) |>  # Removed the "Other" condition
  ungroup() |> 
  group_by(date_elec, party_grouped) |> 
  summarise(num_municipalities = n(), .groups = "drop")

party_winners <- party_winners |> 
  mutate(party_grouped = fct_reorder(party_grouped, -num_municipalities))

ggplot(party_winners, aes(x = party_grouped, y = num_municipalities, fill = party_grouped)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ date_elec, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = party_colors) +
  labs(
    title = "Winning Parties in Municipalities with >100k Population",
    x = NULL,
    y = "Number of Municipalities Won"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), 
    axis.ticks.x = element_line(),       
    axis.line.y = element_blank(),       
    axis.line.x = element_blank(),       
    panel.grid.major.x = element_blank(),
    legend.position = "none",  
    strip.text = element_text(size = 12, face = "bold")
  )

2. Which party was the second when the first was the PSOE? And when the first was the PP?

Identify the First and Second Parties

ranked_parties <- final_election |> 
  group_by(date_elec, code_municipality) |> 
  arrange(desc(party_ballots)) |> 
  mutate(rank = row_number()) |> 
  filter(rank <= 2) |>  
  ungroup()

Filter when PSOE is first

psoe_first <- ranked_parties |> 
  filter(rank == 1 & party_code == "PSOE") |> 
  left_join(ranked_parties |> filter(rank == 2), by = c("date_elec", "code_municipality")) |> 
  rename(
    first_party = party_code.x,  
    first_votes = party_ballots.x,  
    second_party = party_code.y,  
    second_votes = party_ballots.y,
    population = population.y
  ) |> 
  select(date_elec, code_municipality, population, first_party, first_votes, second_party, second_votes)

Filter when PP is first

pp_first <- ranked_parties %>%
  filter(rank == 1 & party_code == "PP") %>%  
  left_join(ranked_parties %>% filter(rank == 2), by = c("date_elec", "code_municipality")) |> 
  rename(
    first_party = party_code.x,  
    first_votes = party_ballots.x,  
    second_party = party_code.y,  
    second_votes = party_ballots.y,
    population = population.y
  ) |> 
  select(date_elec, code_municipality, population, first_party, first_votes, second_party, second_votes)

Summarized PSOE first

psoe_summary <- psoe_first %>%
  group_by(date_elec, second_party) %>%
  summarise(
    total_votes = sum(second_votes),  
    count_as_second = n(),           
    .groups = "drop"
  ) %>%
  arrange(desc(date_elec)) 
print(psoe_summary)
# A tibble: 39 × 4
   date_elec  second_party total_votes count_as_second
   <date>     <chr>              <dbl>           <int>
 1 2019-11-10 BNG                    0               1
 2 2019-11-10 CS                   707              15
 3 2019-11-10 EAJ-PNV            64750              10
 4 2019-11-10 EH-BILDU            5840               5
 5 2019-11-10 ERC               133515              39
 6 2019-11-10 OTHER              56993             108
 7 2019-11-10 PODEMOS-IU        288459             256
 8 2019-11-10 PP               2293802            2505
 9 2019-11-10 VOX               532008             550
10 2019-04-28 CS               1341613             629
# ℹ 29 more rows
psoe_summary_per_year <- psoe_first %>%
  group_by(date_elec) %>%                                
  filter(second_votes == max(second_votes)) %>%          
  select(date_elec, second_party) %>%      
  arrange(date_elec)  |> 
  print()
# A tibble: 6 × 2
# Groups:   date_elec [6]
  date_elec  second_party
  <date>     <chr>       
1 2008-03-09 CIU         
2 2011-11-20 PP          
3 2015-12-20 PP          
4 2016-06-26 PP          
5 2019-04-28 PP          
6 2019-11-10 PP          

PP was the second party when PSOE was the winner in all the elections.

Summarized PP first

pp_summary <- pp_first %>%
  group_by(date_elec, second_party) %>%
  summarise(
    total_votes = sum(second_votes),  
    count_as_second = n(),           
    .groups = "drop"
  ) %>%
  arrange(desc(date_elec)) 
print(pp_summary)
# A tibble: 43 × 4
   date_elec  second_party total_votes count_as_second
   <date>     <chr>              <dbl>           <int>
 1 2019-11-10 BNG                  191               4
 2 2019-11-10 CS                   291              40
 3 2019-11-10 EAJ-PNV              365               4
 4 2019-11-10 ERC                  140               2
 5 2019-11-10 OTHER              25322              85
 6 2019-11-10 PODEMOS-IU          3875              42
 7 2019-11-10 PSOE             1045161            2124
 8 2019-11-10 VOX               183012             572
 9 2019-04-28 BNG                  207               2
10 2019-04-28 CS                 42963             395
# ℹ 33 more rows
pp_summary_per_year <- pp_first %>%
  group_by(date_elec) %>%                                
  filter(second_votes == max(second_votes)) %>%          
  select(date_elec, second_party) %>%      
  arrange(date_elec)  |> 
  print()
# A tibble: 6 × 2
# Groups:   date_elec [6]
  date_elec  second_party
  <date>     <chr>       
1 2008-03-09 PSOE        
2 2011-11-20 PSOE        
3 2015-12-20 PODEMOS-IU  
4 2016-06-26 PODEMOS-IU  
5 2019-04-28 PSOE        
6 2019-11-10 PSOE        

PSOE was the second party when PP was the winner in almost every election. PODEMOS-IU was the second party in the elections of 2015 and 2016 after PP.

Vis PSOE first

ggplot(psoe_first, aes(y = reorder(date_elec, desc(date_elec)), 
                     x = second_votes, fill = second_party)) +
  
  geom_bar(stat = "identity", position = "identity", width = 0.7) +
  scale_fill_manual(values = c("PP" = "#0157a1", "PODEMOS-IU" = "#663278", "VOX" = "#5AB531", "CS" = "#EB6109")) +
  scale_x_continuous(labels = scales::comma) + 
  theme_minimal() +
  labs(x="", y="", title= "Second party when the winner is PSOE", subtitle = "total n of votes per election") +
   theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0),
    plot.subtitle = element_text(face = "italic", size = 12, hjust = 0),
    legend.position = "bottom",
    legend.title=element_blank(),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.title = element_blank(),
    panel.grid.major.y = element_line(color = "gray", linetype = "dashed", size = 0.3)
  ) 
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Second most voted party by population when the first id PSOE

# Step 1: Create population categories with ordered factors
psoe_first <- psoe_first %>%
  mutate(
    population_category = factor(
      case_when(
        population < 10000 ~ "Pueblo Pequeño",
        population >= 10000 & population < 50000 ~ "Pueblo Mediano",
        population >= 50000 & population < 100000 ~ "Pueblo Grande",
        population >= 100000 & population < 500000 ~ "Ciudad Pequeña",
        population >= 500000 & population < 1000000 ~ "Gran Ciudad",
        population >= 1000000 ~ "Metrópolis"
      ),
      levels = c("Pueblo Pequeño", "Pueblo Mediano", "Pueblo Grande", 
                 "Ciudad Pequeña", "Gran Ciudad", "Metrópolis")
    )
  )

# Step 2: Loop through elections and create a plot for each election
unique_dates <- unique(psoe_first$date_elec)
plots <- list()

for (date in  unique_dates) {
  # Ensure `date` is treated as a valid Date object
  current_date <- as.Date(date)
  
  # Filter data for the specific election date
  data_filtered <- psoe_first %>%
    filter(date_elec == current_date) %>%
    group_by(population_category, second_party) %>%
    summarise(
      total_votes = sum(second_votes, na.rm = TRUE),
      .groups = "drop"
    )
  
  # Create the plot
  plot <- ggplot(data_filtered, aes(x = population_category, 
                                    y = total_votes, 
                                    fill = second_party)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.7) +
    scale_fill_manual(values = c(
      "PP" = "#0157a1", 
      "PODEMOS-IU" = "#663278", 
      "VOX" = "#5AB531", 
      "CS" = "#EB6109"
    )) +
    scale_y_continuous(labels = scales::comma) + 
    labs(
      title = paste("Second Party by Population for Election on", format(current_date, "%Y-%m-%d")),  # Format the date properly
      x = "Population Category",
      y = "Total Votes"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
      axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
      axis.text.y = element_text(size = 10),
      legend.position = "bottom",
      legend.title = element_blank()  # Remove legend title
    )
  
  # Save the plot to the list
  plots[[as.character(current_date)]] <- plot
}

# Step 3: Display all plots (one at a time)
for (date in unique_dates) {
  print(plots[[as.character(as.Date(date))]])
}

Vis PP first

ggplot(pp_first, aes(y = reorder(date_elec, desc(date_elec)), 
                     x = second_votes, fill = second_party)) +
  
  geom_bar(stat = "identity", position = "identity", width = 0.7) +
  scale_fill_manual(values = c("PSOE" = "#f20400", "PODEMOS-IU" = "#663278", "VOX" = "#5AB531", "CS" = "#EB6109")) +
  scale_x_continuous(labels = scales::comma) + 
  theme_minimal() +
  labs(x="", y="", title= "Second party when the winner is PP", subtitle = "total n of votes per election") +
   theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0),
    plot.subtitle = element_text(face = "italic", size = 12, hjust = 0),
    legend.position = "bottom",
    legend.title=element_blank(),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.title = element_blank(),
    panel.grid.major.y = element_line(color = "gray", linetype = "dashed", size = 0.3)
  ) 

Second most voted party by population when the first id PSOE

# Step 1: Create population categories with ordered factors for PP
pp_first <- pp_first %>%
  mutate(
    population_category = factor(
      case_when(
        population < 10000 ~ "Pueblo Pequeño",
        population >= 10000 & population < 50000 ~ "Pueblo Mediano",
        population >= 50000 & population < 100000 ~ "Pueblo Grande",
        population >= 100000 & population < 500000 ~ "Ciudad Pequeña",
        population >= 500000 & population < 1000000 ~ "Gran Ciudad",
        population >= 1000000 ~ "Metrópolis"
      ),
      levels = c("Pueblo Pequeño", "Pueblo Mediano", "Pueblo Grande", 
                 "Ciudad Pequeña", "Gran Ciudad", "Metrópolis")
    )
  )

# Step 2: Loop through elections and create a plot for each election for PP
unique_dates_pp <- unique(pp_first$date_elec)
plots <- list()

# Create a list to store plots for PP
plots_pp <- list()

for (date in unique_dates_pp) {
  # Ensure `date` is treated as a valid Date object
  current_date <- as.Date(date)
  
  # Filter data for the specific election date
  data_filtered <- pp_first %>%
    filter(date_elec == current_date) %>%
    group_by(population_category, second_party) %>%
    summarise(
      total_votes = sum(second_votes, na.rm = TRUE),
      .groups = "drop"
    )
  
  # Create the plot
  plot <- ggplot(data_filtered, aes(x = population_category, 
                                    y = total_votes, 
                                    fill = second_party)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.7) +
    scale_fill_manual(values = c(
      "PSOE" = "#f20400", 
      "PODEMOS-IU" = "#663278", 
      "VOX" = "#5AB531", 
      "CS" = "#EB6109"
    )) +
    scale_y_continuous(labels = scales::comma) + 
    labs(
      title = paste("Second Party by Population for Election on", format(current_date, "%Y-%m-%d")),  # Format the date properly
      x = "Population Category",
      y = "Total Votes"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
      axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
      axis.text.y = element_text(size = 10),
      legend.position = "bottom",
      legend.title = element_blank()  # Remove legend title
    )
  
  # Save the plot to the list
  plots_pp[[as.character(current_date)]] <- plot
}

# Step 3: Display all plots for PP (one at a time)
for (date in unique_dates_pp) {
  print(plots_pp[[as.character(as.Date(date))]])
}

For the presentation

pp_first <- pp_first |> 
  mutate(
    population_category = factor(
      case_when(
        population < 10000 ~ "<10.000",
        population >= 10000 & population < 50000 ~ ">= 10.000 & < 50.000",
        population >= 50000 & population < 100000 ~ ">= 50.000 & < 100.000",
        population >= 100000 & population < 500000 ~ ">= 100.000 & < 500.000",
        population >= 500000 & population < 1000000 ~ ">= 500.000 & < 1.000.000",
        population >= 1000000 ~ ">= 1.000.000"
      ),
      levels = c("<10.000", ">= 10.000 & < 50.000", ">= 50.000 & < 100.000", 
                 ">= 100.000 & < 500.000", ">= 500.000 & < 1.000.000", ">= 1.000.000")
    )
  )
data_filtered_2019 <- pp_first |> 
  filter(date_elec == as.Date("2019-04-28")) |> 
  group_by(population_category, second_party) |> 
  summarise(
    total_votes = sum(second_votes, na.rm = TRUE),
    .groups = "drop"
  )
plot_2019 <- ggplot(data_filtered_2019, aes(x = population_category, 
                                            y = total_votes, 
                                            fill = second_party)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  scale_fill_manual(values = c(
    "PSOE" = "#f20400", 
    "PODEMOS-IU" = "#663278", 
    "VOX" = "#5AB531", 
    "CS" = "#EB6109"
  )) +
  scale_y_continuous(labels = scales::comma) + 
  labs(
    title = "Second Party by Population for Election on 2019-04-28",
    x = "Inhabitants per city",
    y = "Total Votes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10),
    legend.position = "bottom",
    legend.title = element_blank()  # Remove legend title
  )
print(plot_2019)

3. Who benefits from low turnout?

library(ggplot2)
library(dplyr)
library(forcats)
library(scales)

# Define a color mapping that matches the logo colors of the parties
color_mapping <- c(
  "PSOE" = "#E41A1C",    
  "PP" = "skyblue",      
  "UP-IU" = "#984EA3",   
  "OTHER" = "#999999",   
  "ERC" = "#FF7F00",     
  "PNV" = "#A65628",     
  "BILDU" = "#4DAF4A",   
  "CS" = "#FFFF33",      
  "CIU" = "#66C2A5",     
  "VOX" = "#A6D854",     
  "COMPROMIS" = "#B3B3B3",
  "BNG" = "#8DA0CB",     
  "EH-BILDU" = "pink",
  "PODEMOS-IU" = "#FF33FF", 
  "MP" = "#FFD700",     
  "EAJ-PNV" = "#1B9E77"  
)

turnout2 <- final_election |> 
  mutate(actual_turnout = (total_votes / population) * 100)

# Determine the low turnout threshold (25th percentile)
low_turnout_threshold <- quantile(turnout2$actual_turnout, 0.25, na.rm = TRUE)

low_turnout_data <- turnout2 |> 
  filter(actual_turnout < low_turnout_threshold)

# Summarize party performance in low turnout areas
low_turnout_summary <- low_turnout_data |> 
  group_by(party_code) |> 
  summarise(
    total_ballots = sum(party_ballots, na.rm = TRUE),
    vote_share = (total_ballots / sum(low_turnout_data$party_ballots)) * 100
  ) |> 
  ungroup() |> 
  mutate(party_code = fct_reorder(party_code, vote_share))

ggplot(low_turnout_summary, aes(x = party_code, y = vote_share, fill = party_code)) +
  geom_bar(stat = "identity", color = "black", width = 0.7) +
  labs(
    title = "Vote Share by Party in Low Turnout Areas (<70.7%)",
    subtitle = paste("Threshold for low turnout is", round(low_turnout_threshold, 1), "%"),
    x = "Party",
    y = "Vote Share (%)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "grey40")
  ) +
  scale_fill_manual(values = color_mapping) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Additionally, we will examine municipalities with exceptionally low turnout rates (below 0.45) to determine whether the outcomes differ significantly.

# Analyze performance in extreme low turnout areas (<45%)
extreme_low_turnout_threshold <- 45
extreme_low_turnout_data <- turnout2 |> 
  filter(actual_turnout < extreme_low_turnout_threshold)

# Summarize party performance in extreme low turnout areas
extreme_low_turnout_summary <- extreme_low_turnout_data |> 
  group_by(party_code) |> 
  summarise(
    total_ballots = sum(party_ballots, na.rm = TRUE),
    vote_share = (total_ballots / sum(extreme_low_turnout_data$party_ballots)) * 100
  ) |> 
  ungroup() |> 
  mutate(party_code = fct_reorder(party_code, vote_share))

ggplot(extreme_low_turnout_summary, aes(x = party_code, y = vote_share, fill = party_code)) +
  geom_bar(stat = "identity", color = "black", width = 0.7) +
  labs(
    title = "Vote Share by Party in Extreme Low Turnout Areas (<45%)",
    subtitle = "Focus on regions with significantly low turnout rates",
    x = "Party",
    y = "Vote Share (%)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "grey40")
  ) +
  scale_fill_manual(values = color_mapping) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

total_extreme_low_votes <- extreme_low_turnout_data |> 
  group_by(code_community, date_elec) |> 
  summarise(
    extreme_low_votes = length(code_community), 
    .groups = "drop"
  ) |> 
  summarise(total_extreme_low_votes = sum(extreme_low_votes)) |> 
  pull(total_extreme_low_votes)
total_extreme_low_votes
[1] 948
extreme_low_turnout_data |> 
  group_by(code_community, date_elec) |> 
  summarise(extreme_low_votes = length(code_community),.groups = "drop" ) |> 
  pivot_wider(names_from = date_elec,
              values_from = extreme_low_votes) 
# A tibble: 6 × 7
  code_community `2015-12-20` `2011-11-20` `2019-04-28` `2019-11-10`
  <chr>                 <int>        <int>        <int>        <int>
1 02                       12           NA           NA           NA
2 07                       NA           12           NA           NA
3 08                       NA           NA           36           36
4 09                       NA           NA           NA           NA
5 13                       12           NA           NA           36
6 14                       NA           NA           NA           NA
# ℹ 2 more variables: `2016-06-26` <int>, `2008-03-09` <int>

4. How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?

In this section, to look at the relationship between population and vote % for each party, we use log(population) to normalise the distribution a bit. Further, we avoid using raw party vote counts because those are always positively correlated with population. We create a vote % function to check for trends.

#Less than 10,000 are rural constituencies and more than or equal to 10,000 are urban constituencies:
final_election2 <- final_election |> 
  mutate(rural = ifelse(population < 10000, 1, 0), # 1 is rural, 0 is urban
         vote_pct = party_ballots/valid_votes)  

# visualise the overall trend - use log population to normalise  
ggplot(final_election2, aes(x = log(population), y = vote_pct)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, colour = "red") + # Add trend lines
  labs(title = "Votes vs. Population by Party",
       x = "Population log",
       y = "Pct of votes for party") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

# Use regression to confirm there is no significant relationship between census and vote
model <- lm(vote_pct ~ log(population), data = final_election2)
summary(model)

Call:
lm(formula = vote_pct ~ log(population), data = final_election2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.08335 -0.08331 -0.08330  0.01910  0.91672 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.327e-02  7.221e-04 115.330   <2e-16 ***
log(population) 5.365e-06  1.096e-04   0.049    0.961    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1515 on 584842 degrees of freedom
Multiple R-squared:  4.096e-09, Adjusted R-squared:  -1.706e-06 
F-statistic: 0.002396 on 1 and 584842 DF,  p-value: 0.961
# Now breakdown for party - check for each party if a trend in vote % received visually with log population. This seems to show that PP perform best in small areas (negatively correlated with population), while PSOE, OTHER and PODEMUS-IU perform better in higher population areas (positive correlation). 
ggplot(final_election2, aes(x = log(population), y = vote_pct)) +
  geom_point(alpha=0.2) +
  geom_smooth(method = "lm", se = TRUE, aes(color = party_code, group = party_code)) + # Add trend lines
  facet_wrap(~party_code)+
  labs(title = "Votes vs. Population by Party",
       x = "Population log",
       y = "Pct of votes for party") +
  theme_minimal() +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

# use regresison to compare the interaction effects on rural areas and party code, and how that impacts vote percentages.
rural_compare <- lm(vote_pct ~ rural*party_code, data = final_election2)
summary(rural_compare)

Call:
lm(formula = vote_pct ~ rural * party_code, data = final_election2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.37423 -0.03514 -0.01086 -0.00035  0.98914 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 0.006540   0.001664   3.931 8.45e-05 ***
rural                      -0.004044   0.001726  -2.343  0.01914 *  
party_codeCIU               0.006892   0.002353   2.929  0.00340 ** 
party_codeCS                0.078273   0.002353  33.270  < 2e-16 ***
party_codeEAJ-PNV           0.010041   0.002353   4.268 1.98e-05 ***
party_codeEH-BILDU          0.001941   0.002353   0.825  0.40946    
party_codeERC               0.019535   0.002353   8.303  < 2e-16 ***
party_codeMP               -0.004568   0.002353  -1.942  0.05219 .  
party_codeOTHER             0.069790   0.002353  29.664  < 2e-16 ***
party_codePODEMOS-IU        0.128137   0.002353  54.464  < 2e-16 ***
party_codePP                0.286444   0.002353 121.752  < 2e-16 ***
party_codePSOE              0.286215   0.002353 121.655  < 2e-16 ***
party_codeVOX               0.038336   0.002353  16.294  < 2e-16 ***
rural:party_codeCIU         0.005744   0.002441   2.353  0.01864 *  
rural:party_codeCS         -0.020329   0.002441  -8.327  < 2e-16 ***
rural:party_codeEAJ-PNV    -0.001671   0.002441  -0.685  0.49359    
rural:party_codeEH-BILDU    0.006305   0.002441   2.582  0.00981 ** 
rural:party_codeERC         0.006913   0.002441   2.832  0.00463 ** 
rural:party_codeMP          0.002423   0.002441   0.992  0.32102    
rural:party_codeOTHER      -0.001797   0.002441  -0.736  0.46165    
rural:party_codePODEMOS-IU -0.037962   0.002441 -15.549  < 2e-16 ***
rural:party_codePP          0.085288   0.002441  34.934  < 2e-16 ***
rural:party_codePSOE        0.004051   0.002441   1.659  0.09706 .  
rural:party_codeVOX        -0.000234   0.002441  -0.096  0.92364    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0981 on 584820 degrees of freedom
Multiple R-squared:  0.5805,    Adjusted R-squared:  0.5805 
F-statistic: 3.519e+04 on 23 and 584820 DF,  p-value: < 2.2e-16
#Counting the total number of votes for each political party in urban and rural areas
# count the proportion of votes received in each area, then compare for statistically significant differences.
party_votes <- final_election2 |> 
  group_by(rural, party_code) |> 
  summarise(total_votes = sum(party_ballots, na.rm = TRUE), .groups = "drop") |> 
  arrange(rural, desc(total_votes))
party_votes
# A tibble: 24 × 3
   rural party_code total_votes
   <dbl> <chr>            <dbl>
 1     0 PP            32959071
 2     0 PSOE          30976694
 3     0 PODEMOS-IU    15494030
 4     0 CS             9923037
 5     0 OTHER          7747105
 6     0 VOX            4897596
 7     0 ERC            2512782
 8     0 EAJ-PNV        1409088
 9     0 CIU            1231905
10     0 EH-BILDU        604590
# ℹ 14 more rows
#Visualizing the performance of urban and rural political parties:
ggplot(party_votes, aes(x = party_code, y = total_votes, fill = as.factor(rural))) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Proportion of votes for parties in Rural vs Urban Areas",
       x = "Party",
       y = "Total Votes",
       fill = "Rural status  \n(1 = Rural, 0 = Urban)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

#Two-way ANOVA to examine urban-rural and partisan interactions
final_election2$rural <- as.factor(final_election2$rural)
final_election2$party_code <- as.factor(final_election2$party_code)

#Ex post facto tests: further analysis of partisan differences. It finds that the differences are statistically different between groups for rural*party code interaction
anova_model <- aov(vote_pct ~ rural*party_code, data = final_election2)
summary(anova_model)
                     Df Sum Sq Mean Sq   F value Pr(>F)    
rural                 1      0     0.0     0.001  0.974    
party_code           11   7758   705.3 73293.834 <2e-16 ***
rural:party_code     11     29     2.7   276.967 <2e-16 ***
Residuals        584820   5628     0.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5. How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?

survey_q5 <- final_surveys |> 
  summarise(avg_voting_intention = mean(votes), .by = c(date_elec, party_code))

# The means do not add up to 1 (sum surveys just do not add up to 1)
# survey_q5 |> 
#   summarise(sum(avg_vote), .by = date_elec)

election_q5 <- final_election |>
  summarise(party_votes = sum(party_ballots), .by = c(date_elec, party_code))

turnout_q5 <- turnout |> 
  summarise(tot_votes = sum(total_votes), 
            valid_votes = sum(valid_votes + blank_votes), 
            null_votes = sum(null_votes),
            tot_pop = sum(population),
            .by = c(date_elec)) |> 
  mutate(turnout = (tot_votes/tot_pop)*100)

election_q5 <- election_q5 |> 
  left_join(turnout_q5, by = join_by(date_elec)) |> 
  mutate(share_votes = (party_votes/valid_votes)*100)

final_q5 <- survey_q5 |> 
  left_join(election_q5, by = join_by(date_elec, party_code)) |> 
  select(-tot_votes, -tot_pop, -party_votes) |> 
  mutate(polling_error = avg_voting_intention - share_votes)

polling_err <- final_q5 |> 
  summarise(avg_err = mean(polling_error), .by = party_code)

ggplot(polling_err, aes(x = reorder(party_code, avg_err), y = avg_err, fill = factor(party_code))) +
  geom_bar(stat = "identity") +  
  scale_fill_manual(values = party_colors) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "Polling Error by Political Party",
    subtitle = "Average polling error for elections between 2008 and 2019 by main political party",
    x = "Party",
    # y = "Average Polling Error")
    y = paste("\u2190 Underestimated", strrep(" ", 35), "Overestimated \u2192")) +
  scale_y_continuous(limits = c(-5,3), 
                     expand = expansion(add = c(0,0)),
                     breaks = c(-4, -2, 0, 2)) +
  theme_minimal() +
  theme(
    legend.position = NULL,
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.text.x = element_text(size = 5), 
    panel.grid.minor = element_blank(),
    axis.title.y = element_text(size = 8)) +
  guides(fill = guide_legend(title = "Political Party"))

6. Which polling houses got it right the most and which ones deviated the most from the results?

Before starting with this question, it’s interesting to discuss how to measure the goodness of prediction of one pollsters.

In our view, this perspective is somewhat narrow if it is only based on accurately predicting the winning party (or the one with the most votes) in each case: studies in this assignment have found that there have been a number of recent instances in Spain where large parties with more votes have not been able to form an independent government, and that large political parties have often needed to collaborate with smaller parties.

However, it is also inappropriate to consider predictions of results for all parties (hundreds) as equally important, as many small parties end up not even getting a single parliamentary seat.

Therefore, our measure in this question is to divide the parties into two parts: Assign a weight of 0.7 to the top five parties receiving the most votes in each general election, and a weight of 0.3 to the remaining parties.

In this way, the top five parties with the most votes get more weight than the other small parties, but we do not exclude the small parties completely from our accuracy assessment.

Haven defined the criteria, firstly, we will summarise the prediction of each pollster for each party in each of these elections, as well as the party ballots rate in every election. Then, we join these two tables with the key date_elec_party_code for later calculation.

pollster_estimate <- final_surveys |> 
  select(date_elec, pollster, party_code, votes) |> 
  mutate(date_elec_party_code = paste(date_elec, party_code, sep = "_")) |> 
  select(date_elec_party_code, pollster, votes)

final_election_summary <- final_election |> 
  group_by(date_elec, party_code) |> 
  reframe(sum_ballots = sum(party_ballots)) 

total_votes_summary<- final_election |> 
  dplyr::select(-party_main, -party_code, -party_ballots) |> 
  distinct() |> 
  group_by(date_elec) |> 
  summarise(sum_total_votes=sum(valid_votes))

final_election_summary<- final_election_summary |> 
  left_join(total_votes_summary, by=c("date_elec"="date_elec")) |> 
  mutate(national_share = sum_ballots/sum_total_votes) |> 
  mutate(date_elec_party_code = paste(date_elec, party_code, sep = "_")) |> 
  left_join(pollster_estimate, by=c("date_elec_party_code"="date_elec_party_code")) |> 
  dplyr::select(-date_elec_party_code) 

Now we will calculate the absolute error of pollsters predictions comparing with real result, assigning weight of 0.7 to the first 5 large parties in each election, and 0.3 to the remaining parties.

final_election_summary_v2 <- final_election_summary |> 
  group_by(date_elec) |>  
  arrange(date_elec, desc(national_share)) |>  
  mutate(rank_pos = dense_rank(-national_share)) |>  
  mutate(weight = if_else(rank_pos <= 5, 0.7, 0.3)) |>  
  ungroup()

Then, we can now calculate the weighted MAE of each pollster:

final_election_summary_v3 <- final_election_summary_v2 |> 
  mutate(national_share=national_share*100) |> 
  mutate(error = abs(votes - national_share)) 
  
wmae <- final_election_summary_v3 |> 
  group_by(pollster) |> 
  summarise(WMAE = sum(weight * error) / sum(weight)) |> 
  ungroup() |> 
  mutate(pollster = fct_reorder(pollster, WMAE, .desc = FALSE))

We can obtain the 5 pollsters that predict the best and the 5 that predict the worst.

head(wmae$pollster[order(wmae$WMAE)], 5)
[1] IMOP         SOCIOMÉTRICA APPEND       METRA SEIS   VOX PÚBLICA 
46 Levels: IMOP SOCIOMÉTRICA APPEND METRA SEIS VOX PÚBLICA ... NETQUEST
tail(wmae$pollster[order(wmae$WMAE)], 5)
[1] DYM           SIMPLE LÓGICA METROSCOPIA   MYWORD        NETQUEST     
46 Levels: IMOP SOCIOMÉTRICA APPEND METRA SEIS VOX PÚBLICA ... NETQUEST

Here we will visualize the performance of these pollsters:

ggplot(wmae, aes(x = pollster, y = WMAE, fill = WMAE)) +
  geom_col() +  
  theme_minimal(base_size = 10) +
  theme(
    panel.grid.major = element_line(color = "grey80", size = 0.5),  
    panel.grid.minor = element_blank(), 
    axis.text.x = element_text(angle = 65, hjust = 1, size = 8),
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "bold", size = 12),
    axis.title.y = element_text(face = "bold", size = 10)
  ) +
  labs(
    title = "Weighted Average Polling Error by Pollster",
    subtitle = "Comparison of polling errors across different organizations",
    x = NULL,
    y = "Weighted Average Error"
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_gradient2(low = "green", mid = "#ffc403", high = "#c70b1f", midpoint = 3.2) 

Additional questions

8. How has the turnout rate changed over time? And within each election year, how are turnout rates correlated with the municipalities’ populations?

question8<- final_election |> 
  group_by(date_elec,municipality) |> 
  summarize(population=mean(population),
            total_votes=mean(total_votes),
            turnout_rate=(total_votes/population),
            .groups = "drop") 
turnout_over_time <- question8 |> 
  group_by(date_elec) |> 
  summarize(avg_turnout = (sum(total_votes) / sum(population)),
                           na.rm = TRUE, 
                           .groups = "drop")

ggplot(turnout_over_time, aes(x = date_elec, y = avg_turnout)) +
  geom_line(color = "#ffc403", size = 2) +  
  geom_point(color = "#c70b1f", size = 3) +  
  geom_text_repel(aes(label = scales::percent(avg_turnout, accuracy = 0.01)),
                  size = 3, 
                  box.padding = 0.5,
                  segment.color = NA,
                  color="#c70b1f") +
  labs(
    title = "Change in Turnout Rate Over Time",
    x = "Election Date",
    y = "Average Turnout Rate"
  ) +
  scale_y_continuous(
    limits = c(0.6, 0.8), 
    labels = percent_format(scale = 100) 
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12)
  )

We can observe that there is no clear tendency of turnout rate. However, it is interesting seeing that these two ‘snap’ elections in 2016 and 2019 were held shortly after the previous elections, with a very brief gap between each one and its predecessor. Both elections were triggered by the inability of any party to secure an absolute majority, leading to the call for a second ‘emergency’ election. However, the voter turnout in both of these elections was significantly lower compared to the previous ones, suggesting that the nature of such snap elections may reduce citizens’ willingness to vote again.

And apart from the election year and specific election context, how are turnout rates correlated with the municipalities’ populations within each election year?

Since the population of municipalities ranges from very small to extremely large, with some major cities having populations hundreds of times greater than small villages, we will apply a logarithmic transformation to the population for a more effective regression analysis.

cor_value_overall <- cor(question8$population, question8$turnout_rate)
lm_model_overall <- lm(turnout_rate ~ log(population), data = question8)
slope_overall <- coef(lm_model_overall)[2]
p_value_overall <- summary(lm_model_overall)$coefficients[2, 4]

calculate_slope_and_cor <- function(data) {
  data <- data %>%
    mutate(log_population = log(population))
  cor_value <- cor(data$log_population, data$turnout_rate, use = "complete.obs")
  lm_model <- lm(turnout_rate ~ log_population, data = data)
  slope <- coef(lm_model)[2]
  return(data.frame(cor_value = cor_value, slope = slope))
}

slope_cor_values <- question8 %>%
  group_by(date_elec) %>%
  do(calculate_slope_and_cor(.)) %>%
  ungroup()


ggplot(question8, aes(x = log(population), y = turnout_rate)) +
  geom_point(color = "#ffc403") +  
  geom_smooth(method = "lm", color = "#c70b1f", linetype = "dashed") +  
  facet_wrap(~ date_elec) +
  geom_text(data = slope_cor_values, 
            aes(x = max(log(question8$population)), y = max(question8$turnout_rate), 
                label = paste("r = ", round(cor_value, 2), "\nSlope = ", round(slope, 4))), 
            color = "#c70b1f", size = 2, hjust = 1, vjust = 1) +
  theme_minimal() +
  labs(title = "Correlation between log of Population\nand Turnout rate over time",
       subtitle= paste("Combining election results from multiple years, \nthe overall correlation is ", round(cor_value_overall, 2),"\nthe overall slope is ", round(slope_overall, 4), "with a p-value ", p_value_overall, "."))+
   theme(
    plot.title = element_text(face = "bold", hjust = 0, size = 11),
    plot.subtitle = element_text(hjust = 0, size = 9, color = "grey40"),
    axis.title = element_text(size = 10),
    aspect.ratio = 1.2
  )
`geom_smooth()` using formula = 'y ~ x'

We can observe that the correlation is always negative during this period, the slope is around -0.01 for log(population), and its overall p-value is very close to 0, that is, statistically very significant over time.

The slope -0.01 can be interpreted in this way: If the population doubles, the turnout rate in this municipality will decrease by approximately 0.69%.

In conclusion, although there is no clear tendency of turnout rate evolution over time, we do observe two important insights:

  1. The nature of “snap” elections like in 2016 and 2019, as a result of the inability of any party to secure an absolute majority, may reduce citizens’ willingness to vote again.
  2. The population size has a negative effect on turnout rate: if the population doubles, the turnout rate in this municipality will decrease by approximately 0.69%.

9. How has electoral support for smaller parties evolved over time?

The maps show a significant increase in support for smaller parties over time by autonomous community, with darker regions indicating higher percentages. Madrid, for example, rises from around 10% in 2008 to a peak of 54% in April 2019. Northern communities including Catalonia, Navarra, and the Basque Country consistently show high levels of support for small parties.

The April 2019 election marked the height of support for smaller parties overall, but the November 2019 election, a re-vote triggered by the failure to form a majority coalition, shows a slight decline in their support, as reflected in the lighter shading across many regions.

# Load necessary libraries
library(mapSpain)
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(ggplot2)
library(dplyr)
library(ggiraph)

# Calculate % (PSOE + PP) and % other parties for each community and date
party_percentage <- final_election |> 
  group_by(code_community, date_elec) |> 
  summarize(
    total_votes = sum(party_ballots, na.rm = TRUE),
    PSOE_PP_votes = sum(party_ballots[party_code %in% c("PSOE", "PP")], na.rm = TRUE),
    other_votes = total_votes - PSOE_PP_votes
  ) |> 
  mutate(
    PSOE_PP_percent = (PSOE_PP_votes / total_votes) * 100,
    other_percent = (other_votes / total_votes) * 100
  ) |> 
  ungroup()
`summarise()` has grouped output by 'code_community'. You can override using
the `.groups` argument.
# Get map and simplify geometry
spain_map <- esp_get_ccaa() |> 
  st_simplify(dTolerance = 0.01)

# Community mapping with names
community_map <- tibble(
  code_community = c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17"),
  codauto = c("01", "02", "03", "04", "05", "06", "08", "07", "09", "11", "12", "13", "15", "16", "14", "17", "10"),
  community_name = c(
    "Andalusia", "Aragon", "Asturias", "Balearic Islands", "Canary Islands", "Cantabria",
    "Castile and Leon", "Castile-La Mancha", "Catalonia", "Extremadura", "Galicia",
    "Madrid", "Navarra", "Basque Country", "Murcia", "La Rioja", "Valencia"
  )
)

# Add codauto and community names to your electoral data
party_percentage <- party_percentage |> 
  left_join(community_map, by = "code_community")

# Merge spatial data with electoral data using codauto
map_data <- spain_map |> 
  left_join(party_percentage, by = "codauto") |> 
  drop_na(date_elec)

# Create the ggplot map
gg_map <- ggplot(map_data) +
  geom_sf_interactive(
    aes(
      fill = other_percent, # Correct fill variable
      geometry = geometry,
      tooltip = paste(
        "Region: ", community_name,
        "<br>% Other Votes: ", round(other_percent, 2), "%"
      ),
      data_id = codauto
    ),
    color = "white"
  ) +
  facet_wrap(~ date_elec) +
  scale_fill_gradient(low = "white", high = "darkblue", name = "% Other Votes") +
  labs(
    title = "Trends in Electoral Support for Non-Major Parties by Region",
    subtitle = "Where 'Other' Excludes PP and PSOE",
    fill = "% Other Votes"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    strip.text = element_text(size = 8)
  )

# Convert ggplot to an interactive plot with ggiraph
interactive_map <- girafe(
  ggobj = gg_map,
  width_svg = 10,
  height_svg = 6
)

# Optional: Adjust tooltip style
interactive_map <- interactive_map |> 
  girafe_options(
    opts_tooltip(css = "background-color: white; color: black; border: 1px solid black; padding: 5px;"),
    opts_hover(css = "fill: red")
  )

# Display the interactive map
interactive_map